129 research outputs found
Evaluating Emotional Nuances in Dialogue Summarization
Automatic dialogue summarization is a well-established task that aims to
identify the most important content from human conversations to create a short
textual summary. Despite recent progress in the field, we show that most of the
research has focused on summarizing the factual information, leaving aside the
affective content, which can yet convey useful information to analyse, monitor,
or support human interactions. In this paper, we propose and evaluate a set of
measures , to quantify how much emotion is preserved in dialog summaries.
Results show that, summarization models of the state-of-the-art do not preserve
well the emotional content in the summaries. We also show that by reducing the
training set to only emotional dialogues, the emotional content is better
preserved in the generated summaries, while conserving the most salient factual
information
Cross-domain Voice Activity Detection with Self-Supervised Representations
Voice Activity Detection (VAD) aims at detecting speech segments on an audio
signal, which is a necessary first step for many today's speech based
applications. Current state-of-the-art methods focus on training a neural
network exploiting features directly contained in the acoustics, such as Mel
Filter Banks (MFBs). Such methods therefore require an extra normalisation step
to adapt to a new domain where the acoustics is impacted, which can be simply
due to a change of speaker, microphone, or environment. In addition, this
normalisation step is usually a rather rudimentary method that has certain
limitations, such as being highly susceptible to the amount of data available
for the new domain. Here, we exploited the crowd-sourced Common Voice (CV)
corpus to show that representations based on Self-Supervised Learning (SSL) can
adapt well to different domains, because they are computed with contextualised
representations of speech across multiple domains. SSL representations also
achieve better results than systems based on hand-crafted representations
(MFBs), and off-the-shelf VADs, with significant improvement in cross-domain
settings
Can GPT models Follow Human Summarization Guidelines? Evaluating ChatGPT and GPT-4 for Dialogue Summarization
This study explores the capabilities of prompt-driven Large Language Models
(LLMs) like ChatGPT and GPT-4 in adhering to human guidelines for dialogue
summarization. Experiments employed DialogSum (English social conversations)
and DECODA (French call center interactions), testing various prompts:
including prompts from existing literature and those from human summarization
guidelines, as well as a two-step prompt approach. Our findings indicate that
GPT models often produce lengthy summaries and deviate from human summarization
guidelines. However, using human guidelines as an intermediate step shows
promise, outperforming direct word-length constraint prompts in some cases. The
results reveal that GPT models exhibit unique stylistic tendencies in their
summaries. While BERTScores did not dramatically decrease for GPT outputs
suggesting semantic similarity to human references and specialised pre-trained
models, ROUGE scores reveal grammatical and lexical disparities between
GPT-generated and human-written summaries. These findings shed light on the
capabilities and limitations of GPT models in following human instructions for
dialogue summarization
From speech to facial activity: towards cross-modal sequence-to-sequence attention networks
Abstract
Multimodal data sources offer the possibility to capture and model interactions between modalities, leading to an improved understanding of underlying relationships. In this regard, the work presented in this paper explores the relationship between facial muscle movements and speech signals. Specifically, we explore the efficacy of different sequence-to-sequence neural network architectures for the task of predicting Facial Action Coding System Action Units (AUs) from one of two acoustic feature representations extracted from speech signals, namely the extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPs) or the Interspeech Computational Paralinguistics Challenge features set (ComParE). Furthermore, these architectures were enhanced by two different attention mechanisms (intra- and inter-attention) and various state-of-the-art network settings to improve prediction performance. Results indicate that a sequence-to-sequence model with inter-attention can achieve on average an Unweighted Average Recall (UAR) of 65.9 % for AU onset, 67.8 % for AU apex (both eGeMAPs), 79.7 % for AU offset and 65.3 % for AU occurrence (both ComParE) detection over all AUs.2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP)
DOI: 10.1109/MMSP46350.2019
Funding : BMW Group Researc
AV+ EC 2015--the first affect recognition challenge bridging across audio, video, and physiological data
We present the first Audio-Visual+ Emotion recognition Challenge and workshop (AV+EC 2015) aimed at comparison of multimedia processing and machine learning methods for automatic audio, visual and physiological emotion analysis. This is the 5th event in the AVEC series, but the very first Challenge that bridges across audio, video and physiological data. The goal of the Challenge is to provide a common benchmark test set for multimodal information processing and to bring together the audio, video and physiological emotion recognition communities, to compare the relative merits of the three approaches to emotion recognition under well-defined and strictly comparable conditions and establish to what extent fusion of the approaches is possible and beneficial. This paper presents the challenge, the dataset and the performance of the baseline system
AV+ EC 2015--the first affect recognition challenge bridging across audio, video, and physiological data
We present the first Audio-Visual+ Emotion recognition Challenge and workshop (AV+EC 2015) aimed at comparison of multimedia processing and machine learning methods for automatic audio, visual and physiological emotion analysis. This is the 5th event in the AVEC series, but the very first Challenge that bridges across audio, video and physiological data. The goal of the Challenge is to provide a common benchmark test set for multimodal information processing and to bring together the audio, video and physiological emotion recognition communities, to compare the relative merits of the three approaches to emotion recognition under well-defined and strictly comparable conditions and establish to what extent fusion of the approaches is possible and beneficial. This paper presents the challenge, the dataset and the performance of the baseline system
The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism
The INTERSPEECH 2013 Computational Paralinguistics Challenge provides for the first time a unified test-bed for Social Signals such as laughter in speech. It further introduces conflict in group discussions as new tasks and picks up on autism and its manifestations in speech. Finally, emotion is revisited as task, albeit with a broader ranger of overall twelve emotional states. In this paper, we describe these four Sub-Challenges, Challenge conditions, baselines, and a new feature set by the openSMILE toolkit, provided to the participants.
\em Bj\"orn Schuller, Stefan Steidl, Anton Batliner, Alessandro Vinciarelli, Klaus Scherer}\\
{\em Fabien Ringeval, Mohamed Chetouani, Felix Weninger, Florian Eyben, Erik Marchi, }\\
{\em Hugues Salamin, Anna Polychroniou, Fabio Valente, Samuel Kim
REACT2023: The First Multiple Appropriate Facial Reaction Generation Challenge
The Multiple Appropriate Facial Reaction Generation Challenge (REACT2023) is the first competition event focused on evaluating multimedia processing and machine learning techniques for generating human-appropriate facial reactions in various dyadic interaction scenarios, with all participants competing strictly under the same conditions. The goal of the challenge is to provide the first benchmark test set for multi-modal information processing and to foster collaboration among the audio, visual, and audio-visual behaviour analysis and behaviour generation (a.k.a generative AI) communities, to compare the relative merits of the approaches to automatic appropriate facial reaction generation under different spontaneous dyadic interaction conditions. This paper presents: (i) the novelties, contributions and guidelines of the REACT2023 challenge; (ii) the dataset utilized in the challenge; and (iii) the performance of the baseline systems on the two proposed sub-challenges: Offline Multiple Appropriate Facial Reaction Generation and Online Multiple Appropriate Facial Reaction Generation, respectively. The challenge baseline code is publicly available at https://github.com/reactmultimodalchallenge/baseline-react2023.</p
AVEC 2016 – Depression, mood, and emotion recognition workshop and challenge
The Audio/Visual Emotion Challenge and Workshop (AVEC 2016) "Depression, Mood and Emotion" will be the sixth competition event aimed at comparison of multimedia processing and machine learning methods for automatic audio, visual and physiological depression and emotion analysis, with all participants competing under strictly the same conditions. The goal of the Challenge is to provide a common benchmark test set for multi-modal information processing and to bring together the depression and emotion recognition communities, as well as the audio, video and physiological processing communities, to compare the relative merits of the various approaches to depression and emotion recognition under well-defined and strictly comparable conditions and establish to what extent fusion of the approaches is possible and beneficial. This paper presents the challenge guidelines, the common data used, and the performance of the baseline system on the two tasks
- …